1 Sampling Design: Script overview

This chapter is designed for users with a basic understanding of R programming, and provides a comprehensive guide to implementing a soil sampling design for soil monitoring and mapping.

💡Note: Ensure you have the tinytex, knitr, rmarkdown, and rstudioapi R packages installed to successfully run and render R Markdown documents before running this script.
Additionally, install tinytex::install_tinytex() by running tinytex::install_tinytex() to enable LaTeX support for PDF generation.

1.1 Setting Up the Environment and Loading Libraries

First, set the working directory to the location of your R script. This ensures that all paths in the script are relative to the script’s location.

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
setwd("../") # Move wd down to the main folder
getwd()
## [1] "D:/FAO/github/SoilFER/sampling design"

1.2 Install required libraries

The script uses several R libraries, each serving a specific purpose. Below is a quick summary of the key functions of these libraries:

Summary of required R libraries and Their purposes
Package Purpose
sp Handling spatial data and performing geographic operations.
terra Working with raster and vector spatial data, including transformations and analyses.
raster An older but still widely used library for handling raster data. The terra R package has been designed as its replacement.
sf Modern and efficient library for handling vector spatial data.
sgsR Tools for spatial grid sampling and geostatistics.
entropy Estimating entropy and related measures.
tripack Generating triangulations and handling Voronoi tessellations.
tibble An enhanced version of data frames with better usability.
manipulate Creating interactive visualizations.
dplyr Data manipulation and transformation.
synoptReg Clustering and Principal Component Analysis (PCA) for spatial data.
doSNOW Parallel computing for speeding up tasks.
Rfast Tools for faster computation of statistical methods.
fields Tools for spatial data interpolation and visualization.
ggplot2 Data visualization using the grammar of graphics.
rassta Raster spatial sampling and analysis tools.
snowfall Parallel computing framework for distributing computations efficiently across multiple cores or nodes.

Some libraries might need to be installed from GitHub before loading. Uncomment and run the lines below to install the remotes and then the synoptReg R packages. If you have not faced any issues loading all these libraries at once, there is no need to run the commented lines.

# Install synoptReg package from github
    #install.packages("remotes") # Install remotes if not installed
    #remotes::install_github("lemuscanovas/synoptReg")

packages <- c("sp", "terra", "raster", "sf", "sgsR", "entropy", 
              "tripack", "tibble", "manipulate", "dplyr", "synoptReg", 
              "doSNOW", "Rfast", "fields", "ggplot2", "rassta", "snowfall")

invisible(lapply(packages, library, character.only = TRUE))

rm(packages)

An alternative and beginner-friendly way of install R packages using RStudio is through the Packages tab in RStudio. First of all, locate this tab in the bottom-right pane of RStudio and click on it. Then, click the Install button at the top of the tab. In the dialog box that appears, type the name of the package (e.g., terra) in the text field. Make sure that Install dependencies box is checked, and finally, click Install to download and install the package.

1.3 Define variables, and parameters

Several variables and parameters must be defined to establish a consistent framework for the script.

The ISO.code variable is used to define the unique Alpha-3 ISO code of the country of interest. This standard ensures consistency when naming variables and file paths throughout the script.

ISO.code <- "ZMB" 

💡Note: In this example, “ZMB” is used as the ISO code for Zambia.

The SoilFER project focuses on three primary land-use types: croplands (85%), grasslands (10%), and forests (5%). In this example, the landuse variable specifies croplands as the focus of analysis.

landuse <- "crops"

This section sets the file paths for accessing data such as raster files, shapefiles, and other resources. Additionally, it ensures the creation of a results folder for storing processed files related to the specified land use. Organizing file paths ensures that all resources and results are systematically stored and accessed.

# Path to data folders
raster.path <- "data/rasters/"
shp.path <- "data/shapes/"
other.path <- "data/other/"
landuse_dir <- paste0("data/results/", landuse, "/")
   
# check if landuse directory exists if not create it
if (!file.exists(landuse_dir)){dir.create(landuse_dir)}

results.path <- landuse_dir

The epsg variable specifies the coordinate reference system used for spatial data. EPSG codes are globally recognized standards for geospatial datasets. If you do not have this information, you can search here.

epsg <- "EPSG:3857"

Users can calculate the total number of sampling sites based on predefined values. For example, out of 300 total sites, 80% are selected for analysis (share), as this percentage aligns with the proportion of croplands. The final number of sites is stored in nsites. Adjusting the number of sites ensures that only a subset of the data is analyzed. However, further ahead in this document, we propose a statically sound approach to determining the optimal sample size.

nsites <- 300
share <- 0.80
nsites <- nsites * share

Primary Sampling Units (PSUs) and Secondary Sampling Units (SSUs) are defined here, along with their sizes and the number of target/alternative SSUs. PSUs are large units of 2 km x 2 km used as a sampling framework, while SSUs are smaller units (1 hectare or 100 m x 100 m) nested within PSUs. Tertiary Sampling Units (TSUs) represent individual sampling points within SSUs.

# Define the number of PSUs to sample
# n.psu <- round(nsites/8)
  
# Define PSU and SSUs sizes 
psu_size <- 2000  # (default = 2km x 2 km)
ssu_size <- 100 # (default = 100m x 100m = 1 ha)
  
# Define number of target and alternative SSUs at each PSU
num_primary_ssus <- 4
num_alternative_ssus <- 4
    
# Define number of TSUs at each SSU
number_TSUs <- 3

The iterations variable defines how many times the clustering algorithm should run to optimize results. Increasing the number of iterations improves the accuracy of cluster assignment but may increase computation time. 10 is enough.

iterations <- 10

The percent_crop variable specifies the minimum percentage of crop coverage required for a PSU to be included in the sampling design. This filter ensures that sampling efforts focus on areas with significant crop presence, improving relevance and efficiency.

# Define the minimum crop percent in selected PSUs
percent_crop <- 10 

1.4 Functions

Covariate Space Coverage (CSC) focuses on ensuring sampling points are well-distributed in terms of environmental covariates, and useful for scenarios where the region has heterogeneous environmental conditions.

The CSIS function is the backbone of clustering sampling units in the covariate space. It ensures optimal distribution of sampling units based on covariate data while considering any fixed legacy data. Here’s a step-by-step explanation:

Step-by-step Explanation:

  • Input Parameters:
    • fixed: A dataset of preselected (fixed) sampling points.
    • nsup: Number of additional (new) sampling points to be selected.
    • nstarts: Number of random starting points for the clustering algorithm.
    • mygrd: A grid of covariate data for the region of interest.
  • Workflow:
    • Initialization:
      • Extract the fixed points (fixed) and identify their indices (units).
      • Exclude the fixed points from the grid of covariates (mygrd_minfx).
    • Clustering:
      • Randomly initialize nsup new sampling points (centers_sup).
      • Combine the fixed points and new centers to form the initial cluster centers (centers).
    • Iterative Optimization:
      • Compute the distance between each point in mygrd and the cluster centers.
      • Assign each point in mygrd to the nearest cluster center.
      • Update the cluster centers by calculating the mean of the points assigned to each cluster.
      • Check for convergence: Stop if the cluster centers do not change significantly.
    • Evaluation:
      • Calculate the Mean Squared Shortest Distance (MSSSD) to evaluate the clustering quality.
      • Save the best clustering result among all nstarts attempts.
  • Output:
    • centers: The optimized cluster centers, including both fixed and new points.
    • cluster: The cluster assignment for each point in the grid.

💡Note: This function is particularly useful for creating spatially balanced sampling designs by considering existing sampling points and covariate distributions.

# Define Covariate Space Coverage function
# Clustering CSC function with fixed legacy data
CSIS <- function(fixed, nsup, nstarts, mygrd) {
  n_fix <- nrow(fixed)
  p <- ncol(mygrd)
  units <- fixed$units
  mygrd_minfx <- mygrd[-units, ]
  MSSSD_cur <- NA
  for (s in 1:nstarts) {
    units <- sample(nrow(mygrd_minfx), nsup)
    centers_sup <- mygrd_minfx[units, ]
    centers <- rbind(fixed[, names(mygrd)], centers_sup)
    repeat {
      D <- rdist(x1 = centers, x2 = mygrd)
      cluster <- apply(X = D, MARGIN = 2, FUN = which.min) %>% as.factor(.)
      centers_cur <- centers
      for (i in 1:p) {
        centers[, i] <- tapply(mygrd[, i], INDEX = cluster, FUN = mean)
      }
      
      #restore fixed centers
      centers[1:n_fix, ] <- centers_cur[1:n_fix, ]
      #check convergence
      sumd <- diag(rdist(x1 = centers, x2 = centers_cur)) %>% sum(.)
      if (sumd < 1E-12) {
        D <- rdist(x1 = centers, x2 = mygrd)
        Dmin <- apply(X = D, MARGIN = 2, FUN = min)
        MSSSD <- mean(Dmin^2)
        if (s == 1 | MSSSD < MSSSD_cur) {
          centers_best <- centers
          clusters_best <- cluster
          MSSSD_cur <- MSSSD
        }
        
        break
      }
      
    }
    
    print(paste0(s," out of ",nstarts))
  }
  
  list(centers = centers_best, cluster = clusters_best)
}

The generate_tsu_points_within_ssu function focuses on generating TSUs within SSUs. It ensures that TSUs are distributed randomly within the defined SSUs while adhering to specific land-use constraints. This function is crucial for creating a hierarchical sampling structure, where TSUs represent the smallest sampling units.

Step-by-step Explanation:

  • Input Parameters:
    • ssu: A SSU where TSUs will be generated.
    • number_TSUs: Number of TSUs to be sampled within the SSU.
    • index: Index of the SSU within the PSU grid.
    • ssu_type: The type of SSU (e.g., Target or Alternative).
    • crops: A raster layer defining the land-use constraints, ensuring TSUs fall within the desired land-use area (e.g., cropland).
  • Workflow:
    • SSU Masking:
      • Convert the selected SSU to a spatial vector (SpatVector) format for masking.
      • Clip the land-use raster (crops) to the boundaries of the SSU, creating a localized raster for sampling.
    • TSU Sampling:
      • Use the sample_srs function to randomly generate TSU points within the clipped raster area. This ensures TSUs are distributed across the SSU while respecting land-use constraints.
      • If the number of sampled points is less than the desired number (number_TSUs), repeat the sampling process using the spatSample function until the required number of TSUs is obtained.
    • Metadata Assignment:
      • Assign metadata to the sampled TSUs, including:
        • PSU_ID: The PSU ID containing the SSU.
        • SSU_ID: The current SSU ID.
        • TSU_ID: A sequential identifier for each TSU within the SSU.
        • SSU_Type: The type of SSU (e.g., Target or Alternative).
        • TSU_Name: A unique name for each TSU, combining PSU, SSU, and TSU identifiers.
  • Output:
    • Returns a dataset containing the randomly sampled TSU points with metadata.

💡Note: This function ensures spatial consistency by sampling TSUs only within specified land-use areas, making it ideal for hierarchical sampling designs.

generate_tsu_points_within_ssu <- function(ssu, number_TSUs, index, ssu_type, crops) {
  # Convert SSU to SpatVector for masking
  ssu_vect <- ssu_grid_sf[rownames(ssu_grid_sf) == index, ]
  # Clip the spatRaster to the SSU to focus the sampling within the SSU boundaries
  clipped_lu <- terra::crop(crops, ssu_vect)
  # Generate random points within the clipped spatRaster using sample_srs
  sampled_points <- sample_srs(clipped_lu, nSamp = number_TSUs)  # Ensure this is compatible or modify
  # Make sure the TSUs falls within landuse areas
  while (nrow(sampled_points) < number_TSUs) {
    sampled_points <- spatSample(clipped_lu, size = number_TSUs)
  }
  
  # Add metadata to the sampled points
  sampled_points$PSU_ID <- selected_psu$ID
  sampled_points$SSU_ID <- index
  sampled_points$TSU_ID <- seq_len(nrow(sampled_points))
  sampled_points$SSU_Type <- ssu_type
  sampled_points$TSU_Name <- paste0(ISO.code, sprintf("%04d", sampled_points$PSU_ID),"-",sampled_points$SSU_ID,"-",seq_len(nrow(sampled_points)))
  return(sampled_points)
}

1.5 Country boundaries and legacy data

In this step, file paths for country boundaries and legacy soil data are defined using the file.path function. These paths are used to locate and load the corresponding shapefiles. After that, the shapefiles are loaded (st_read) and converted to an sf object (st_as_sf) and transformed (st_transform) to the desired and pre-defined CRS using the EPSG code provided in earlier steps using the sf package.

# Define location of country boundaries
country_boundaries <- file.path(paste0(shp.path,"roi_epsg_3857.shp"))
# Define location of soil legacy data
legacy <- file.path(paste0(shp.path,"zmb_legacy_v2_clipped_epsg_3857.shp"))

# Load and transform the country boundaries
country_boundaries <- sf::st_read(country_boundaries, quiet=TRUE)

if(sf::st_crs(country_boundaries)$epsg!=epsg){
  country_boundaries <- sf::st_transform(country_boundaries, crs = epsg)
}

Legacy soil data is loaded if the file exists (i.e., if the country has it.). If the data is in a different CRS, it is transformed to match the CRS used for the country boundaries.

# Load legacy data (if it exists)
if(file.exists(legacy)){
  legacy <- sf::st_read(legacy, quiet=TRUE)
  # Transform coordinates to the common projection system
  if(sf::st_crs(legacy)$epsg!=epsg){
    legacy <- sf::st_transform(legacy, crs=epsg)
  }
  
  } else {
    # If legacy data does not exist delete the object "legacy"
    rm(legacy)
  }

💡Note: Always verify that all geospatial datasets use the same CRS before performing any spatial operations. Misaligned CRSs can lead to incorrect analyses.

Finally, the ggplot2 package is used to visualize the country boundaries and overlay the legacy soil data. This ensures the data was loaded and transformed correctly.

ggplot(data = country_boundaries) + 
  geom_sf() + 
  geom_sf(data = legacy, aes(geometry = geometry)) +
  theme_minimal()

1.6 Environmental covariates for selecting PSUs

The additional environmental covariates include information on non-protected areas, high slope mask, geology, and geomorphology. These inputs are optional and do not affect the selection of sampling units.

Each explanatory dataset is specified as a file path. This approach ensures clear organization of data sources. The variables geo.classes and geomorph.classes are defined to identify fields within the geology and geomorphology datasets that store their respective types/classes.

# Non-protected areas (raster)
npa <- file.path(paste0(shp.path,"zmb_national_parks_zari_clipped_epsg_3857.shp"))
#  High slopes (binary raster with 1=<50% and NA = >=50%) - QGIS Calculator 
slope <- file.path(paste0(raster.path,"zmb_clipped_slope_mask_epsg_3857.tif")) 
# Geology
geo <- file.path(paste0(shp.path,"zmb_geology_clipped_epsg_3857.shp"))
# Field for geologic classes in the shape
geo.classes <- "GEO"
# Geomorphology
geomorph <- file.path(paste0(raster.path,"zmb_clipped_Geomorphon.tif"))
# Field for geomorphology classes in the shape
geomorph.classes <- "CODR"

After define the location of those files, it is time to load them.

Non-protected areas (NPAs) can be represented as either shapefiles or raster files. The provided code processes shapefiles directly. If the dataset is in raster format, some lines must be commented or uncommented accordingly to handle raster operations. The primary goal is to calculate the difference between the country boundary and the protected areas, isolating non-protected areas for use in subsequent steps.

# Non-protected areas (if it exists)
if(file.exists(npa)){
  npa <- sf::st_read(npa, quiet = FALSE) # Import Protected area
  npa <- sf::st_union(npa)
  npa <- sf::st_difference(country_boundaries, npa)
  #npa <- rast(npa) 
  if(sf::st_crs(npa)$epsg !=epsg){
    npa <- sf::st_transform(npa, crs = epsg)
    #npa <- project(npa, epsg, method="near")
  }
  
    } else {
      # If non-protected areas data does not exist delete the object "npa"
      rm(npa)
    }

The slope mask represents areas with a slope less than 50%. This dataset is typically generated using GIS software such as QGIS. After loading, the mask values are standardized to 0 or 1. This step ensures that only areas meeting the slope criteria are included in subsequent analyses. If slope data is missing, the variable is removed.

# Slopes < 50% raster (if it exists)
if(file.exists(slope)){
  slope <- rast(slope)
  if(crs(slope)!=epsg){
    slope <- project(slope, epsg, method="near")
  }
  slope <- slope/slope
  } else {
    # If slope data does not exist delete the object "slope"
    rm(slope)
  }

The geology dataset is a shapefile containing geological classes. If the dataset exists, it is loaded, and its CRS is transformed for consistency. The geological classes are then converted to numeric values for compatibility with subsequent processing steps.

# Geology (if it exists)
if(file.exists(geo)){
  geo <- sf::st_read(geo, quiet=TRUE)
  if(sf::st_crs(geo)$epsg !=epsg){
    geo <- sf::st_transform(geo, crs=epsg)
  }
  geo$GEO <- as.numeric(as.factor(geo[[geo.classes]]))
  } else {
    # If legacy data does not exist delete the object "geo"
    rm(geo)
  }

Geomorphology data can be provided as either a raster or shapefile. The code handles both formats by checking the file extension. If the data is a raster, it is loaded directly. If it is a shapefile, the CRS is transformed, and the geomorphology classes are converted to numeric values. If the file is neither a raster nor a shapefile, a warning is issued.

# Geomorphology (if it exists)
if (file.exists(geomorph)) {
  # Check the file extension to determine if it is a raster (.tif) or shapefile (.shp)
  file_extension <- tools::file_ext(geomorph)
  if (file_extension == "tif") {
    # If it's a TIFF file, no processing is needed
    geomorph <- rast(geomorph)
    names(geomorph) <- 'GEOMORPH'
    # No further action required for TIFF files
    if(crs(geomorph)!=epsg){
      geomorph <- project(geomorph, epsg, method="near")
    }
    } else if (file_extension == "shp") {
      # If shapefile, execute the shapefile processing code
      geomorph <- sf::st_read(geomorph, quiet = TRUE)
      # Check and transform CRS if necessary
      if (sf::st_crs(geomorph)$epsg != epsg) {
        geomorph <- sf::st_transform(geomorph, crs = epsg)
      }
      geomorph$GEOMORPH <- as.numeric(as.factor(geomorph[[geomorph.classes]]))
      } else {
        # If the file is neither a raster nor a shapefile, handle the error or delete the object
        warning("File format not recognized. Expected .tif or .shp.")
        rm(geomorph)
      }
  } else {
    # If the file does not exist, delete the object "geomorph"
    rm(geomorph)
  }

A list of mandatory environmental covariates was created to serve as parameters for selecting sampling units, providing a baseline for countries lacking high-resolution environmental data. If available, countries may use their own datasets as substitutes.

The following steps outline how covariates are loaded, processed, and prepared for analysis. Let’s take a look into them!

The list.files() function is used to locate raster files containing covariate data, and these files are loaded into a SpatRaster object using terra::rast(). We have displayed in a table the covariate names for easy reference and interpretation.

# Load covariate data
cov.dat <-  list.files(raster.path, pattern = "covs_zam_clipped.tif$",  recursive = TRUE, full.names = TRUE)
cov.dat <- terra::rast(cov.dat) # SpatRaster from terra

# Convert the variable names into a matrix with 3 columns
cov.dat.names <- matrix(names(cov.dat), ncol = 3, byrow = TRUE)

# Create a markdown table
knitr::kable(cov.dat.names, format = "markdown",  
             caption = "Environmental covariates")
Environmental covariates
bio1 bio12 bio13
bio14 bio16 bio17
bio5 bio6 ngd10
pet_penman_max pet_penman_mean pet_penman_min
pet_penman_range sfcWind_max sfcWind_mean
sfcWind_range fpar_030405_500m_mean fpar_030405_500m_sd
fpar_060708_500m_mean fpar_060708_500m_sd fpar_091011_500m_mean
fpar_091011_500m_sd fpar_120102_500m_mean fpar_120102_500m_sd
lstd_030405_mean lstd_030405_sd lstd_060708_mean
lstd_060708_sd lstd_091011_mean lstd_091011_sd
lstd_120102_mean lstd_120102_sd ndlst_030405_mean
ndlst_030405_sd ndlst_060708_mean ndlst_060708_sd
ndlst_091011_mean ndlst_091011_sd ndlst_120102_mean
ndlst_120102_sd ndvi_030405_250m_mean ndvi_030405_250m_sd
ndvi_060708_250m_mean ndvi_060708_250m_sd ndvi_091011_250m_mean
ndvi_091011_250m_sd ndvi_120102_250m_mean ndvi_120102_250m_sd
snow_cover swir_060708_500m_mean crops
flooded_vegetation grass shrub_and_scrub
trees dtm_curvature_250m dtm_downslopecurvature_250m
dtm_dvm2_250m dtm_dvm_250m dtm_elevation_250m
dtm_mrn_250m dtm_neg_openness_250m dtm_pos_openness_250m
dtm_slope_250m dtm_tpi_250m dtm_twi_500m
dtm_upslopecurvature_250m dtm_vbf_250m bio1

💡Note: There is an initial number of 68 environmental covariates.

The newhall dataset contains soil climate variables like temperature and moisture regimes. Subdivisions that are not required (regimeSubdivision1 and regimeSubdivision2) are removed to optimize processing.

# Load soil climate data
newhall <-  list.files(raster.path, pattern = "newhall_zam_clipped.tif$",  recursive = TRUE, full.names = TRUE)
newhall <- terra::rast(newhall) # SpatRaster from terra

# Convert the variable names into a matrix with 3 columns
newhall.names <- matrix(names(newhall), ncol = 3, byrow = TRUE)

# Create a markdown table
knitr::kable(newhall.names, format = "markdown",  
             caption = "Environmental covariates")

# Delete regimeSubdivisions (don't needed)
newhall$regimeSubdivision1 <- c()
newhall$regimeSubdivision2 <- c()
Environmental covariates
annualRainfall waterHoldingCapacity annualWaterBalance
annualPotentialEvapotranspiration summerWaterBalance dryDaysAfterSummerSolstice
moistDaysAfterWinterSolstice numCumulativeDaysDry numCumulativeDaysMoistDry
numCumulativeDaysMoist numCumulativeDaysDryOver5C numCumulativeDaysMoistDryOver5C
numCumulativeDaysMoistOver5C numConsecutiveDaysMoistInSomeParts numConsecutiveDaysMoistInSomePartsOver8C
temperatureRegime moistureRegime regimeSubdivision1
regimeSubdivision2 annualRainfall waterHoldingCapacity

💡Note: The final number of soil climate variables is 17.

For subsequent statistical analysis, categorical covariates must be converted into binary layers to ensure compatibility with spatial clustering. Following this, the original covariates, temperatureRegime and moistureRegime were removed from newhall.

# Convert temperatureRegime and moistureRegime to dummy variables (1 = presence; 0 = absence)
temperatureRegime <- dummies(ca.rast = newhall$temperatureRegime, preval = 1, absval = 0)
moistureRegime <- dummies(ca.rast = newhall$moistureRegime, preval = 1, absval = 0)

# Delete temperatureRegime and moistureRegime rasters from soil climate data
newhall$temperatureRegime <- c()
newhall$moistureRegime <- c()

The processed soil climate data is appended to the main covariate stack (cov.dat). If the dataset uses a different coordinate reference system (CRS), it is reprojected to match epsg.

# Merge covs and climate data
cov.dat <- c(cov.dat, newhall,temperatureRegime,moistureRegime)
    
# Project covariates if necessary
if(crs(cov.dat)!=epsg){
  cov.dat <- terra::project(cov.dat, epsg, method="near")
}

The loaded geology data is converted into raster layers and dummy variables to ensure it can be included in spatial clustering and analysis.

# Transform Geology if exists
if(exists("geo")){
  geo <- rasterize(as(geo,"SpatVector"), cov.dat,field="GEO")
  geo <- dummies(ca.rast = geo$GEO, preval = 1, absval = 0)
  } else {
    # If Geology data does not exist delete the object "geo"
    rm(geo)
  }

# Merge covs and geology if exists)
if(exists("geo")){
  cov.dat <- c(cov.dat, geo)
} 

The same process is applied to geomorphology data, ensuring compatibility with the other covariates.

# Transform geomorphology if exists and 
# Check if the 'geomorph' object exists
if (exists("geomorph")) {
  # Check if 'geomorph' is already a raster
  if (!inherits(geomorph, "SpatRaster")) {
    # If 'geomorph' is not a raster, convert it to a raster
    geomorph <- rasterize(as(geomorph, "SpatVector"), cov.dat, field = "GEOMORPH")
  }
  # Apply the dummies function, assuming geomorph is now a raster
  geomorph <- dummies(ca.rast = geomorph$GEOMORPH, preval = 1, absval = 0)
}

# Merge covs and geomorph (if exists)
if (exists("geomorph")) {
  # Ensure both rasters have the same resolution and extent
  if (!identical(ext(cov.dat), ext(geomorph))) {
    geomorph <- extend(geomorph, cov.dat)
  }
  # Optionally, if they have different resolutions, resample geomorph to match cov.dat
  if (!all(res(cov.dat) == res(geomorph))) {
    geomorph <- resample(geomorph, cov.dat, method = "near")
  }
  # Merge cov.dat and geomorph
  cov.dat <- c(cov.dat, geomorph)
}

After stacking, intermediate datasets like newhall, geomorph, and geo are removed using the rm() function. This is done to free up memory and avoid unnecessary resource usage. By stacking all the relevant layers into cov.dat and cleaning up temporary datasets, the workflow remains efficient, organized, and ready for subsequent analyses like Principal Component Analysis (PCA) or sampling unit selection.

# Remove newhall SpatRaster and clean memory
rm(newhall)
rm(geomorph)
rm(geo)
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 3339657 178.4    5251875 280.5  5251875 280.5
## Vcells 4688291  35.8   10146329  77.5  7365684  56.2

💡Note: Stacking is the process of combining multiple spatial data layers (raster datasets) into a single multi-layer object. In the context of geospatial analysis, this is especially useful when you have multiple variables (e.g., environmental covariates, soil properties, or climate data) that need to be analyzed together as a single dataset.

Covariates are cropped to align with the administrative boundary of the target region. The processed dataset is saved as a raster stack for later use.

# Crop covariates on administrative boundary
cov.dat <- terra::crop(cov.dat, country_boundaries, mask=TRUE, overwrite=TRUE)
#writeRaster(cov.dat, paste0(raster.path,"cov_dat_stack.tif"),overwrite=TRUE)

# Load cov.dat again, in case of issues
#cov.dat <- rast(paste0(raster.path,"cov_dat_stack.tif"))

PCA is used to reduce the dimensionality of the dataset, retaining only the most important components while removing redundancy. Components explaining 99% of the variance are kept.

# Simplify raster information with PCA  
pca <- synoptReg::raster_pca(cov.dat)  # Faster than with terra::princomp

# Get SpatRaster layers from the pca object
cov.dat <- pca$PCA
#summary(cov.dat)

# Subset rasters to main PC (var. explained >= 0.99)
n_comps <- first(which(pca$summaryPCA[3,] > 0.99))
cov.dat <- pca$PCA[[1:n_comps]]

The processed PCA is exported for future analysis or backup in case of issues.

# Save PCA rasters to results folder
#writeRaster(cov.dat,paste0(results.path,"PCA_projected.tif"),overwrite=TRUE)

# Remove pca stack
rm(pca)

# Load pretreated PCA rasters
#cov.dat <- rast(paste0(results.path,"PCA_projected.tif")) 

1.7 Organizing the Sampling Universe

After incorporating all environmental covariates and performing a Principal Component Analysis (PCA), additional steps are taken to refine the sampling universe by removing protected areas, filtering areas with slopes up to 50%, resampling land use data, and filtering legacy data.

This step defines the file path for the land-use raster, ensuring that only areas corresponding to the target land-use (e.g., croplands) are considered in the analysis. The land-use raster acts as a key layer for filtering and refining the sampling universe.

landuse <- file.path(paste0(raster.path,"cropland_clipped_zmb_v1_epsg_3857.tif"))

The land-use is a binary raster where 1 represents land-use of interest and NA represents all others. The operation crops/crops divides each cell value by itself meaning for cell where the value is not NA (e.g., land-use cells with valid data), the division results in 1.

In this example, the current resolution is approximately 20 x 21 m (x, y).

crops <- rast(landuse)
crops <- crops/crops  
names(crops) <- "lu"
crops
## class       : SpatRaster 
## dimensions  : 6126, 8296, 1  (nrow, ncol, nlyr)
## resolution  : 20.61473, 21.17437  (x, y)
## extent      : 3375972, 3546991, -1626000, -1496286  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / Pseudo-Mercator (EPSG:3857) 
## source(s)   : memory
## varname     : cropland_clipped_zmb_v1_epsg_3857 
## name        : lu 
## min value   :  1 
## max value   :  1

Protected areas are excluded from the analysis to ensure that sampling units are not located within legally protected zones. If the npa variable exists, the mask() function is applied to exclude these regions from the crops raster. Once processed, npa is removed to free memory.

If npa is a raster instead of a shapefile, uncomment the appropriate lines in the code and comment out the ones for shapefiles.

if(exists("npa")){
  # If npa is a shapefile
  crops <- terra::mask(crops, npa)
  # If npa is a raster. Resample to match crops
  # npa <- resample(npa,crops, method="near")
  # crops <- crops * npa
}
## |---------|---------|---------|---------|=========================================                                          
rm(npa)

💡Note: Use the correct operation for your npa dataset type (shapefile or raster) to ensure compatibility and accurate processing.

Restricting the analysis to areas with slopes of 50% or less ensures that sampling focuses on accessible and agriculturally viable regions. Steep slopes are excluded as they are often unsuitable for farming or inaccessible.

The slope raster is resampled to align with the resolution of the crops raster before being used to filter out areas with slopes above 50%. The filtered raster ensures the focus remains on usable land.

if(exists("slope")){
  # Resample to match crops
  slope <- resample(slope, crops, method="near")
  crops <- crops * slope
}

rm(slope)

Land-use is resampled to coarser 100 m resolution, significantly optimizing processing speed and reducing storage requirements. This is achieved by aggregating the 20 m resolution raster data to a 1-hectare resolution (100 m x 100 m). Users can take advantage of multi-core processing to speed up this step.

The aggregate() function increases pixel size, where a 5x5 grid of 20 m pixels is combined into one 100 m pixel using the modal value. If we had a 25 m resolution raster, the number would be 4 instead of 5. The cores argument specifies the number of cores to parallelize the process for faster computation.

lu <- aggregate(crops, 5, fun=modal, cores = 4, na.rm=T)
lu <- lu/lu
lu
## class       : SpatRaster 
## dimensions  : 1226, 1660, 1  (nrow, ncol, nlyr)
## resolution  : 103.0736, 105.8718  (x, y)
## extent      : 3375972, 3547074, -1626085, -1496286  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / Pseudo-Mercator (EPSG:3857) 
## source(s)   : memory
## name        : lu 
## min value   :  1 
## max value   :  1

Processed land-use data is saved to ensure data persistence and allow for reuse in future analyses or in case of system interruptions. This is an essential step for safeguarding progress and avoiding re-processing.

The writeRaster() function saves the processed land-use raster to disk, and the rast() function reloads it when needed.

#writeRaster(lu,paste0(raster.path,"cropland_zmb_100m_v1_epsg_3857.tif"),overwrite=TRUE)

lu <- rast(paste0(raster.path,"cropland_zmb_100m_v1_epsg_3857.tif"))

Legacy data points are filtered to retain only those that fall within the designated land-use areas. This step ensures that sampling is consistent with the land-use data and improves the relevance of legacy data for the current analysis.

The terra::extract() function checks whether each legacy point falls within the land-use raster, marking valid points with a corresponding value. Points outside the designated areas are removed.

if (exists("legacy")){
  legacy$INSIDE <- terra::extract(crops, legacy) %>% dplyr::select(lu)
  legacy <- legacy[!is.na(legacy$INSIDE),] %>%
    dplyr::select(-"INSIDE")
}

1.8 Creating Primary Sampling Units

In this section, we generate a grid of Primary Sampling Units (PSUs). PSUs are 2x2 km grid cells that serve as the foundational sampling units.

The st_make_grid() function is used to create a grid of 2x2 km square polygons within the boundaries of the study region. The size of the grid cells is determined by the psu_size variable that we defined right in the beginning of this code, and each grid cell is assigned a unique identifier (ID).

psu_grid <- st_make_grid(country_boundaries, cellsize = c(psu_size, psu_size), square = TRUE)
psu_grid <- st_sf(geometry = psu_grid)
psu_grid$ID <- 1:nrow(psu_grid)

The next step is to trim the grid to the administrative boundaries of the study region. This ensures that only PSUs within the defined country boundaries are retained. This operation is computationally intensive, especially for large datasets.

psu_grid <- psu_grid[country_boundaries[1],]

The processed PSU grid is saved to a shapefile using the write_sf() function. Saving the grid ensures that the trimmed version is available for future use without needing to repeat the computationally expensive trimming process. The st_read() function reloads the saved shapefile as needed.

#write_sf(psu_grid,paste0(results.path,"../grid2k.shp"),overwrite=TRUE)

psu_grid <- sf::st_read(file.path(paste0(results.path,"../grid2k.shp"))) 

1.9 Select PSUs with land-use above a certain percentage

This section focuses on identifying Primary Sampling Units (PSUs) that meet a minimum threshold for crop coverage. It ensures that sampling efforts are concentrated in areas with sufficient agricultural activity, improving the relevance of the collected data.

the land-use values (lu) corresponding to the cells intersecting with the psu_grid polygons are extracted. These values represent crop coverage data for each PSU grid cell.

extracted_values <- terra::extract(lu, psu_grid)

The extracted land-use data is grouped by grid ID (ID), and the percentage of crop coverage is calculated based on the 1-ha reclassification of land-use (400 pixels represent a 2x2 km PSU).

Temporary object extracted_values are removed to save memory.

crop_perc <- extracted_values %>%
  group_by(ID) %>%
  summarize(crop_perc = sum(lu, na.rm = TRUE)*100/400)

rm(extracted_values)

💡Note: The formula sum(lu, na.rm = TRUE) * 100 / 400 ensures that the crop percentage is calculated accurately, with 400 being the total pixels in a 2x2 km PSU.

The calculated crop coverage percentages (crop_perc) are added to the psu_grid polygons, creating a new attribute for each grid. This step prepares the data for further filtering and visualization.

psu_grid$crop_perc <- crop_perc$crop_perc

Like before, the updated psu_grid with crop coverage percentages is saved to a shapefile for backup and potential reuse in future analyses. If needed, the saved file can be reloaded.

#write_sf(psu_grid, file.path(paste0(results.path,"/psu_grid_counts.shp")), overwrite=TRUE)

psu_grid <- sf::st_read(file.path(paste0(results.path,"/psu_grid_counts.shp"))) 

Only PSUs meeting the predefined crop coverage threshold (percent_crop) are retained. The percent_crop variable is defined earlier in the script, allowing flexibility in setting the threshold. This filtering step narrows the focus to areas that are agriculturally relevant.

psu_grid <- psu_grid[psu_grid$crop_perc > percent_crop,"ID"]

The selected PSUs are rasterized, creating a template where each grid cell represents a specific PSU ID. This raster format is essential for integrating PSUs with covariate data and subsequent spatial analyses.

template <- rast(vect(psu_grid), res = psu_size)
template <- rasterize(vect(psu_grid), template, field = "ID")

1.10 Rasterising PSUs for Covariate Space Coverage

The covariate stack (cov.dat) is cropped to the sampling universe defined by the PSU grid. The cropped covariates are then resampled to match the rasterized PSU template, ensuring consistent spatial alignment.

cov.dat <- terra::crop(cov.dat, psu_grid, mask=TRUE, overwrite=TRUE)

PSU.r <- resample(cov.dat, template)

💡Why this matters: Cropping and resampling ensure that the spatial resolution and extent of covariates match the PSU grid, facilitating accurate spatial analysis and clustering.

1.11 Computing optimal sample size for the PSUs

After rasterising the PSUs (PSU.r) for Covariate Space Coverage, it is essential to calculate the optimal sample size for the sample design and future digital soil mapping based on divergence metrics. The method follows the publication: Divergence metrics for determining optimal training sample size in digital soil mapping DOI: 10.1016/j.geoderma.2023.116553

The source scripts are available on GitHub. We begin by loading the script that contains the function for optimal sample size calculation.

source("scripts/opt_sample.R")

Here, we convert the rasterised PSU into a data frame for further processing. Moreover, the following parameters are set:

  • Minimum sample size: 50
  • Maximum sample size: 2000
  • Incrementing step: 25
  • Number of iterations per sampling trial: 4
# Prepare covariate data
psu.r.df <- data.frame(PSU.r)

# Define the minimum, maximum sample sizes, incrementing step and iterations for sampling trials
initial.n <- 50
final.n <- 3000
by.n <- 25
iters <- 4

Using the opt_sample function, we determine the optimal number of sampling sites based on Kullback-Leibler Divergence metrics. fcs is selected for feature (covariate) coverage sampling. Default is Conditioned Latin Hypercube sampling (clhs). Another important aspect is to understand the methods of selecting the optimal number of samples.

  • Kullback-Leibler Divergence (KLD): A measure of how one probability distribution differs from another. It tells us how much information is “lost” when we approximate one distribution with another.

  • Jensen-Shannon Divergence (JSD): A symmetrized and smoothed version of the Kullback-Leibler divergence. Unlike DKL, DJS is always finite and symmetric.

  • Jensen-Shannon Distance (JS Distance): The square root of the Jensen-Shannon divergence, which makes it a true mathematical distance.

Saurette et al. (2023) pointed out that this method overestimated the optimal sample size. For this reason, we DO NOT consider the JS Distance and normalised JSD.

# Calculate optimal sample size using normalized KL-div, JS-div and JS distance
opt_N_fcs <-  opt_sample(alg="fcs", 
                         s_min=initial.n, 
                         s_max=final.n, 
                         s_step=by.n, 
                         s_reps=iters, 
                         covs = psu.r.df, 
                         cpus=NULL, 
                         conf=0.95)
## R Version:  R version 4.4.0 (2024-04-24 ucrt) 
## 
## Library stats loaded.
## Library fields loaded.
## [1] "COMPUTING DIVERGENCE METRICS FOR  26  SAMPLE SIZES * 4  REPETITIONS =  104"
## ================================================================================

##                   method Sites
## 1         Normalized KLD   424
## 2         Normalized JSD   423
## 3 Normalized JS Distance  1172
opt_N_fcs$optimal_sites
##                   method Sites
## 1         Normalized KLD   424
## 2         Normalized JSD   423
## 3 Normalized JS Distance  1172
optimal_N_KLD <- opt_N_fcs$optimal_sites[1,2]

The final number of PSUs was calculated for the specific land use class, and is stored in n.psu. Users can manually adjust the number of sites if they want to ensure that reflects their need. The number of samples is regarding the PSUs. Since each PSU is expected to include four samples, , the optimal number of target sites will be:1696.

n.psu <- optimal_N_KLD

1.12 Covariate Space Coverage - Computing PSUs

Covariate Space Coverage (CSC) sampling is a clustering technique designed to ensure that sampling units are distributed effectively across a multidimensional space of environmental covariates. Covariates are standardized (centered and scaled), so their mean is 0 and standard deviation is 1. This prevents variables with larger ranges (e.g., elevation vs. precipitation) from dominating the clustering process. The algorithm groups raster cells in covariate space into clusters by minimizing the Mean Squared Shortest Scaled Distance (MSSSD). MSSSD is a measure of how well the points in the dataset fit into their assigned clusters. The term “scaled distance” refers to distances measured in the space of standardized covariates, not physically scaled distances.

Preparing function parameters

Before running the CSC function, the raster stack information at the PSU aggregated level is converted into a dataframe with coordinates. This transformation allows for easier manipulation and scaling of covariates.

PSU.df <- as.data.frame(PSU.r,xy=T)

covs <- names(cov.dat)

Scaling covariates

Environmental covariates are scaled to standardize their values, ensuring all variables contribute equally during clustering. The dataframe mygrd represents the scaled covariates at a resolution of 2 km x 2 km (PSUs).

mygrd <- data.frame(scale(PSU.df[, covs]))

Performing k-means clustering for CSC

If legacy data is available, it is utilized to define fixed cluster centers and enhance the sampling design. If not, a traditional k-means clustering approach is used. Thus, legacy data (legacy) is filtered to align with the extent of the PSU grid using the st_filter function. The st_coordinates function extracts spatial coordinates from the legacy dataset, creating a data frame (legacy_df) containing the x and y positions of legacy points.

A loop iterates through each legacy point. Using the Euclidean distance formula, the script calculates the distance between each legacy point and all PSUs in the PSU.df data frame, which contains covariate information. The nearest PSU to each legacy point is identified using which.min, and its index is stored in the units vector. The covariate data corresponding to these fixed points is extracted and scaled (standardized). This step ensures the covariates have equal influence during clustering. The result is stored in the fixed object.

If legacy data exists, the Covariate Space Coverage Sampling (CSIS) function is used to integrate these fixed legacy points into the clustering process. This ensures the legacy points remain as fixed centers while optimizing the distribution of new points.

If no legacy data is available, standard kmeans clustering is applied to the scaled covariates (mygrd), focusing solely on environmental variability to determine the cluster centers.

 if (exists("legacy")){
   legacy <- st_filter(legacy,psu_grid)
   legacy_df <- st_coordinates(legacy)
   units <- numeric(nrow(legacy_df))
   
   for (i in 1:nrow(legacy_df)) {
     distances <- sqrt((PSU.df$x - legacy_df[i, "X"])^2 + (PSU.df$y - legacy_df[i, "Y"])^2)
     units[i] <- which.min(distances)
   }
   
   fixed <- unique(data.frame(units, scale(PSU.df[, covs])[units, ])) 
   res <- CSIS(fixed = fixed, nsup = n.psu, nstarts = iterations, mygrd = mygrd)
   } else {
     res <- kmeans(mygrd, centers = n.psu, iter.max = 10000, nstart = 100)
   }
## [1] "1 out of 10"
## [1] "2 out of 10"
## [1] "3 out of 10"
## [1] "4 out of 10"
## [1] "5 out of 10"
## [1] "6 out of 10"
## [1] "7 out of 10"
## [1] "8 out of 10"
## [1] "9 out of 10"
## [1] "10 out of 10"

💡Note: Users can follow the progress by observing the printed output. For example, “1 out of 10” indicates that 1 cluster interaction has been done, with 9 more left to process.

This step associates each PSU with its respective cluster, as determined by the clustering algorithm. The cluster assignments from the CSIS function are stored in the PSU dataframe, allowing us to identify the cluster for each grid cell.

PSU.df$cluster <- res$cluster

This step computes the distances from each point in the covariate space to its nearest cluster center. rdist calculates the Euclidean distances between the cluster centers and the scaled covariate data. The apply function identifies the closest cluster center for each PSU, ensuring accurate assignment to clusters.

D <- rdist(x1 = res$centers, x2 = scale(PSU.df[, covs]))
units <- apply(D, MARGIN = 1, FUN = which.min)

The MSSSD measures how well the selected clusters represent the overall covariate space. dmin calculates the shortest distance from each point to a cluster center. MSSSD averages the squared shortest distances, providing a measure of clustering efficiency.

dmin <- apply(D, MARGIN = 2, min)
MSSSD <- mean(dmin^2)

Extracts the PSUs that were selected during the clustering process, including their spatial coordinates and covariates. The myCSCsample dataframe holds only the selected PSUs, along with their spatial coordinates (x, y) and associated covariates.

myCSCsample <- PSU.df[units, c("x", "y", covs)]

Legacy PSUs are sampling points that already exist in the dataset, while new PSUs are points added during the clustering process. This chunk identifies and categorizes the selected PSUs into “legacy” and “new” types. If legacy data exists, it assigns the label legacy to pre-existing sampling points and new to additional points. If no legacy data exists, all selected PSUs are labeled as new.

if (exists("legacy")){
  myCSCsample$type <- c(rep("legacy", nrow(fixed)), rep("new", length(units)-nrow(fixed)))
  } else {
    myCSCsample$type <- "new"
  }

The selected PSUs (myCSCsample) are converted into a spatial object (sf class) with specified coordinates (x, y) and the defined coordinate reference system (epsg). This transformation allows for geospatial operations and visualization.

myCSCsample <-  myCSCsample %>%
  st_as_sf(coords = c("x", "y"), crs = epsg)

Distinguishes between legacy PSUs (existing sampling points) and new PSUs (infill points). Legacy sampling points (type == "legacy") and new points (type == "new") are separated for further analysis and visualization.

if (exists("legacy")){
  legacy <-  myCSCsample[myCSCsample$type=="legacy",]
}

new <-  myCSCsample[myCSCsample$type=="new",] 

Identifies PSUs within the infill data and extracts their unique IDs. The intersection of the PSU grid and infill points ensures only relevant PSUs are selected. The resulting dataframe contains the unique IDs (ID) of these PSUs.

PSUs <- sf::st_intersection(psu_grid, new) %>% dplyr::select(ID)

The target.PSUs dataframe holds only the PSUs identified as targets for sampling. It extracts the PSUs targeted for sampling based on the intersection results.

target.PSUs <- psu_grid[psu_grid$ID %in% PSUs$ID,] %>% dplyr::select(ID)

Temporary objects (new, dmin, MSSSD) are deleted to free up memory and prevent conflicts.

rm(new,dmin,MSSSD)

Here, you can visualize how the selected PSUs are distributed across the first two principal components (PC1 and PC2) of the environmental covariates.

1.13 Computing SSUs and TSUs

In the previous section, we computed PSUs. Now, the focus shifts to SSUs and TSUs, which add hierarchical levels to the sampling process. SSUs are 100m x 100m grid cells within each PSU, selected based on high-resolution environmental covariates derived from the Shuttle Radar Topography Mission (SRTM) digital elevation data and Sentinel-2 multispectral imagery. The Google Earth Engine script for terrain derivatives can be accessed here, while the script for Sentinel-2 multispectral imagery indices is available here. The selection process involves the reallocation of SSUs within each PSU using CSC Sampling to optimise spatial representativeness based on environmental heterogeneity. TSUs are randomly distributed points within each SSU, providing finer-scale sampling resolution.

High-resolution environmental covariates provide fine-scale information that helps in selecting Secondary Sampling Units (SSUs). The following code loads a stack of high-resolution environmental covariates, crops it to the extent of the target PSUs, and applies a mask to ensure alignment.

# R code for creating the raster stack is named "prep_gee_highres_vars"
cov.dat.ssu <- terra::rast(paste0(raster.path, "hres_data/covs_stack_ssu_1ha_ZMB.tif"))

names(cov.dat.ssu)
##  [1] "B2"               "B3"               "B4"               "B5"              
##  [5] "B6"               "B7"               "B8"               "B8A"             
##  [9] "B11"              "B12"              "EVI"              "NDVI"            
## [13] "Brightness_Index" "NBRplus"          "NDBSI"            "Redness_Index"   
## [17] "VNSIR"            "sysisen_B2"       "sysisen_B3"       "sysisen_B4"      
## [21] "sysisen_B5"       "sysisen_B6"       "sysisen_B7"       "sysisen_B8"      
## [25] "sysisen_B8A"      "sysisen_B11"      "sysisen_B12"      "dem"             
## [29] "slope"            "aspect"           "hillshade"        "tpi"             
## [33] "geomorph_1"       "geomorph_2"       "geomorph_3"       "geomorph_4"      
## [37] "geomorph_5"       "geomorph_6"       "geomorph_7"       "geomorph_8"      
## [41] "geomorph_9"
cov.dat.ssu <- subset(cov.dat.ssu, 
                          names(cov.dat.ssu)[!names(cov.dat.ssu) %in% c("sysisen_B2", "sysisen_B3", "sysisen_B4", 
                                                                        "sysisen_B5","sysisen_B6", "sysisen_B7", 
                                                                        "sysisen_B8", "sysisen_B8A","sysisen_B11", 
                                                                        "sysisen_B12")]) 

names(cov.dat.ssu)
##  [1] "B2"               "B3"               "B4"               "B5"              
##  [5] "B6"               "B7"               "B8"               "B8A"             
##  [9] "B11"              "B12"              "EVI"              "NDVI"            
## [13] "Brightness_Index" "NBRplus"          "NDBSI"            "Redness_Index"   
## [17] "VNSIR"            "dem"              "slope"            "aspect"          
## [21] "hillshade"        "tpi"              "geomorph_1"       "geomorph_2"      
## [25] "geomorph_3"       "geomorph_4"       "geomorph_5"       "geomorph_6"      
## [29] "geomorph_7"       "geomorph_8"       "geomorph_9"
cov.dat.ssu[is.na(cov.dat.ssu)] <- 0

ggplot() +
  geom_raster(data = as.data.frame(cov.dat.ssu$NDVI, xy = TRUE), aes(x = x, y = y, fill = NDVI)) +
  scale_fill_viridis_c() +
  #scale_fill_gradientn(colours = colorspace::diverge_hcl(7)) +
  geom_sf(data = target.PSUs, color = "#101010", fill = NA, lwd = 0.5) +
  labs(title = "Target Primary Sampling Units",
       x = "Longitude",
       y = "Latitude",
       fill = "NDVI") +
  theme_minimal()

Empty lists are created to store the computed SSUs (selected_ssus) and TSUs (all_psus_tsus). Each PSU will have its own set of SSUs and TSUs added to these lists. This loop iterates over all selected PSUs generating SSUs and TSUs for each PSU. First of all, a grid of SSUs is created within the PSU boundary (st_make_grid), and only SSUs meeting the land-use coverage threshold (percent_crop) are retained. After that, a specified number of target and alternative SSUs are selected (num_primary_ssus and num_alternative_ssus) using covariate space coverage sampling based on a set of 23 high-resolution environmental covariates. For each SSU, generate TSUs as simple random sampling points without replacement within the SSU boundary being labeled as target or alternative.

selected_ssus <- list()

all_psus_tsus <- list()

for (psu_id in 1:nrow(target.PSUs)) {
      selected_psu <- target.PSUs[psu_id, ]
      
      # Generate SSUs within the selected PSU
      ssu_grid <- st_make_grid(selected_psu, cellsize = c(ssu_size, ssu_size), square = TRUE)
      ssu_grid_sf <- st_sf(geometry = ssu_grid)
      
      # Convert SSU grid to SpatVector
      ssu_grid_vect <- vect(ssu_grid_sf)
      
      # Extract land use values (LU) to filter SSUs
      extracted_values <- extract(crops, ssu_grid_vect, fun = table)
      
      # Add lu code to the SSUs    
      ssu_grid_sf$lu <- (extracted_values[,2]*100)/25
      ssu_grid_sf <- ssu_grid_sf[ssu_grid_sf$lu > percent_crop, ]
      
      # Uncomment these two lines if you're using RGui or RStudio console
      # cat(sprintf("\rProgress: %.2f%% (%d out of %d)", (psu_id / nrow(target.PSUs)) * 100, psu_id, nrow(target.PSUs)))
      # flush.console()
      
      # Print dynamic progress update
      if (psu_id == nrow(target.PSUs)) {
        cat(sprintf("\rProgress: %.2f%% (%d out of %d)", (psu_id / nrow(target.PSUs)) * 100, psu_id, nrow(target.PSUs)))
        flush.console()
        }

      
      # Count available SSUs
      total_ssus <- nrow(ssu_grid_sf)
      
      if (total_ssus >= (num_primary_ssus + num_alternative_ssus)) {
        
        # Extract covariate values from raster stack for each SSU
        ssu_covariates <- terra::extract(cov.dat.ssu, vect(ssu_grid_sf), df = TRUE)
        
        # Merge extracted covariate values with SSU data
        ssu_data <- cbind(ssu_grid_sf, ssu_covariates[, -1])  # Drop first column (index)
        ssu_data_values <- st_drop_geometry(ssu_data)
        
        # Identify columns you do NOT want to scale
        exclude <- grep("^geomorph_|^lu$", names(ssu_data_values), value = TRUE)
        
        # Create two data.frames: one to scale, one to keep as-is
        to_scale <- ssu_data_values[, !names(ssu_data_values) %in% exclude]
        to_keep  <- ssu_data_values[, names(ssu_data_values) %in% exclude, drop = FALSE]
        
        # Drop columns with only NA
        to_scale <- to_scale[, colSums(!is.na(to_scale)) > 0, drop = FALSE]
        
        # Identify zero-variance columns (after removing NA-only columns)
        zero_variance_cols <- sapply(to_scale, function(x) sd(x, na.rm = TRUE) == 0)
        zero_variance_cols[is.na(zero_variance_cols)] <- TRUE  # Treat NA-sd columns as zero variance
        
        # Scale only columns with variance
        scaled_part <- to_scale
        if (any(!zero_variance_cols)) {
          scaled_part[, !zero_variance_cols] <- scale(to_scale[, !zero_variance_cols])
        }
        
        # Combine back
        mygrd_ssu <- cbind(to_keep, scaled_part)
        mygrd_ssu <- mygrd_ssu[, names(mygrd_ssu)]
        
        ### Determine the Optimal Number of Clusters for this PSU ###
        wss_values <- sapply(1:10, function(k) kmeans(mygrd_ssu[, -1], centers = k, nstart = 10)$tot.withinss)
        optimal_k <- which.max(diff(diff(wss_values))) + 1  # Find biggest drop in WSS
        # Ensure a minimum of num_primary_ssus clusters
        optimal_k <- num_primary_ssus
        
        # Perform K-means clustering for this PSU
        kmeans_result <- kmeans(mygrd_ssu[, -1], centers = optimal_k, iter.max = 10000, nstart = 10)
        ssu_data$cluster <- kmeans_result$cluster  # Assign clusters
        #str(ssu_data)
        ssu_data$cluster <- as.factor(ssu_data$cluster)
        
        ### CSC Sampling: Select Closest SSUs to Cluster Centers ###
        # Compute distances from cluster centers to all SSUs
        D <- rdist(x1 = kmeans_result$centers, x2 = mygrd_ssu[, -1])  # Distance matrix
        # Select the SSU closest to each cluster center (Targets)
        target_units <- apply(D, MARGIN = 1, FUN = function(x) order(x)[1])  # Closest
        target_ssus <- ssu_data[target_units, ]
        
        # Select the SSU second closest to each cluster center (Replacements)
        replacement_units <- apply(D, MARGIN = 1, FUN = function(x) order(x)[2])  # Second closest
        replacement_ssus <- ssu_data[replacement_units, ]
        
        # Ensure replacements are in the same cluster as targets
        # Add type info
        target_ssus$SSU_Type <- "Target"
        replacement_ssus$SSU_Type <- "Replacement"
        
        # Assign SSU_IDs: 1–4 for Targets, 5–8 for Replacements
        target_ssus$SSU_ID <- 1:nrow(target_ssus)
        replacement_ssus$SSU_ID <- (nrow(target_ssus) + 1):(2 * nrow(target_ssus))
        
        # Match replacements to target IDs by cluster
        replacement_ssus$replacement_for <- sapply(replacement_ssus$cluster, function(cl) {
          matched <- target_ssus$SSU_ID[target_ssus$cluster == cl]
          if (length(matched) > 0) return(matched[1]) else return(NA)
        })
        
        # Add consistency for Targets
        target_ssus$replacement_for <- NA
        
        # Combine and store
        # Add PSU_ID to both sets
        psu_actual_id <- selected_psu$ID
        target_ssus$PSU_ID <- psu_actual_id
        replacement_ssus$PSU_ID <- psu_actual_id
        selected_ssus[[psu_id]] <- rbind(target_ssus, replacement_ssus)
        
        ### Generate TSUs for Primary and Alternative SSUs ###
        primary_tsus <- lapply(rownames(target_ssus), function(index) {
          generate_tsu_points_within_ssu(ssu_grid_sf[rownames(ssu_grid_sf) == index, ], number_TSUs, index, "Target", crops)
        })
        
        alternative_tsus <- lapply(rownames(replacement_ssus), function(index) {
          generate_tsu_points_within_ssu(ssu_grid_sf[rownames(ssu_grid_sf) == index, ], number_TSUs, index, "Replacement", crops)
        })
        
        # Combine all TSUs for the current PSU
        all_psus_tsus[[psu_id]] <- do.call(rbind, c(primary_tsus, alternative_tsus))
        
        
      } else {
        warning(paste("PSU", psu_id, "does not have enough SSUs for selection. Skipping."))
      }
}
## Progress: 100.00% (424 out of 424)

💡Note: Users can follow the progress by observing the printed output. For example, "1 out of 30" indicates that processing for the first PSU is complete, and there are 29 more PSUs left.

Here, you can see how the SSUs are selected using CSC and the set of high-resolution remote sensing data.

After all TSUs have been generated, they are combined into a single dataset (all_tsus). The dataset includes metadata about each TSU’s type (Target or Alternative) and its hierarchical assignment to SSUs and PSUs.

all_ssus <- do.call(rbind, selected_ssus)
all_ssus <- all_ssus %>%
  mutate_at(vars(PSU_ID, SSU_ID), as.numeric)
    
all_tsus <- do.call(rbind, all_psus_tsus)
all_tsus <- all_tsus %>%
  mutate_at(vars(PSU_ID, SSU_ID), as.numeric)
    
all_tsus <- st_join(all_tsus, all_ssus[c("PSU_ID", "SSU_ID", "SSU_Type", "replacement_for")])
    
all_tsus <- all_tsus %>%
  select(PSU_ID = PSU_ID.x, SSU_ID = SSU_ID.y, SSU_Type = SSU_Type.y,
         Replacement_for = replacement_for, TSU_ID, geometry)
    
all_tsus$TSU_Type <- "Target"
all_tsus[all_tsus$TSU_ID >1,"TSU_Type"] <- "Alternative"
all_tsus$PSU_Type <- "Target"
    
all_tsus <- all_tsus %>%
  dplyr::select("PSU_ID", "SSU_ID", "SSU_Type", "Replacement_for", "TSU_ID", "TSU_Type", "geometry")

1.14 Exporting the target sampling units

This section covers the process of exporting target sampling units.

We start it converting the cluster information into a raster format. Each cell corresponds to a cluster, and the coordinate reference system is assigned to ensure spatial consistency.

dfr <- PSU.df[,c("x","y","cluster")]
dfr$cluster <- as.numeric(dfr$cluster)
dfr <- rasterFromXYZ(dfr)
crs(dfr) = epsg

Here, the target PSUs are associated with their corresponding cluster IDs by extracting the cluster data.

PSU_cluster.id <- unlist(extract(dfr, target.PSUs))
    
valid.PSU_clusters <- target.PSUs %>% mutate(
  cluster = extract(dfr, target.PSUs, fun = mean, na.rm = TRUE)
  )

all.PSU_clusters <- psu_grid %>% mutate(
  cluster = extract(dfr, psu_grid, fun = mean, na.rm = TRUE)
  )

all.PSU_clusters <- na.omit(all.PSU_clusters)

Cluster information from the PSUs is joined to the TSUs for identification and analysis.

valid.PSU_clusters <- valid.PSU_clusters %>%
  rename(Replace_ID= cluster) 

all_tsus <- st_join(all_tsus, valid.PSU_clusters)

Then, an order is assigned to the TSUs within each PSU. The sampling order differentiates between target TSUs (1–4) and alternative TSUs (5–7).

all_tsus <- all_tsus %>%
  group_by(PSU_ID) %>%
  mutate(order = match(SSU_ID, unique(SSU_ID))) %>%
  ungroup()

Each TSU is assigned a unique site ID for traceability. The ID is constructed using the country code, PSU ID, SSU order, and TSU ID.

all_tsus$site_id = paste0(ISO.code, sprintf("%04d", all_tsus$PSU_ID), "-", all_tsus$SSU_ID, "-", all_tsus$TSU_ID, "C")

Finally, target PSUs and TSUs, along with the cluster raster, are saved as shapefiles and raster files for further use.

write_sf(valid.PSU_clusters,paste0(results.path,"/PSUs_target.shp"), overwrite=TRUE)

write_sf(all_tsus,paste0(results.path,"/TSUs_target.shp"), overwrite=TRUE) 

write_sf(all.PSU_clusters,paste0(results.path,"/PSU_pattern_cl.shp"), overwrite=TRUE)
 
writeRaster(dfr, paste0(results.path,"/clusters.tif"), overwrite=TRUE)

1.15 Computing alternative PSUs

This step filters out PSUs already used in valid sampling units, leaving only the remaining PSUs for potential replacements.

remaining.PSU_clusters <- all.PSU_clusters %>%
  filter(!(ID %in% valid.PSU_clusters$ID))

Distinct cluster values from the valid PSUs are identified to guide replacement sampling.

unique_cluster <- distinct(valid.PSU_clusters, Replace_ID)$Replace_ID

A vector is initialized to store indices of the replacement PSUs.

sampled_indices <- integer(0)

The loop iterates through each unique cluster, identifying and sampling potential replacements from the remaining PSUs.

for (clust in unique_cluster) {
  candidates_indices <- which(remaining.PSU_clusters$cluster == clust)
  
  if (length(candidates_indices) > 0) {
    sampled_index <- sample(candidates_indices, size = 1)
    sampled_indices <- c(sampled_indices, sampled_index)
  }
}

The selected replacement indices are used to create a dataset of replacement PSUs.

replacements <- remaining.PSU_clusters[sampled_indices, ]

1.16 Computing alternative SSUs and TSUs for alternative PSUs

Alternative SSUs and TSUs are generated for the alternative PSUs using similar methods as for the target PSUs. This process ensures that a backup is available in case PSUs cannot be sampled in field. Therefore, we will not explain the step-by-step again here.

# Initialize a list to store TSUs for all PSUs
alt_psus_tsus_sf <- list()
selected_ssus_sf <- list()

for (psu_id in 1:nrow(replacements)) {
  selected_psu <- replacements[psu_id, ]
  
  ssu_grid <- st_make_grid(selected_psu, cellsize = c(ssu_size, ssu_size), square = TRUE)
  ssu_grid_sf <- st_sf(geometry = ssu_grid)
  
  ssu_grid_vect <- vect(ssu_grid_sf)
  
  extracted_values <- extract(crops, ssu_grid_vect, fun = table)
  # Add lu code to the SSUs   
  ssu_grid_sf$lu <- (extracted_values[,2]*100)/25
  ssu_grid_sf <- ssu_grid_sf[ssu_grid_sf$lu > percent_crop, ]
      
  # Print dynamic progress update
  if (psu_id == nrow(replacements)) {
    cat(sprintf("\rProgress: %.2f%% (%d out of %d)", (psu_id / nrow(replacements)) * 100, psu_id, nrow(replacements)))
    flush.console()
    }
 
  # Count available SSUs
  total_ssus <- nrow(ssu_grid_sf)
      
  if (total_ssus >= (num_primary_ssus + num_alternative_ssus)) {
        
  # Extract covariate values from raster stack for each SSU
  ssu_covariates <- terra::extract(cov.dat.ssu, vect(ssu_grid_sf), df = TRUE)
        
  # Merge extracted covariate values with SSU data
  ssu_data <- cbind(ssu_grid_sf, ssu_covariates[, -1])  # Drop first column (index)
  ssu_data_values <- st_drop_geometry(ssu_data)
  
  # Identify columns you do NOT want to scale
  exclude <- grep("^geomorph_|^lu$", names(ssu_data_values), value = TRUE)
  # Create two data.frames: one to scale, one to keep as-is
  to_scale <- ssu_data_values[, !names(ssu_data_values) %in% exclude]
  to_keep  <- ssu_data_values[, names(ssu_data_values) %in% exclude, drop = FALSE]
  # Drop columns with only NA
  to_scale <- to_scale[, colSums(!is.na(to_scale)) > 0, drop = FALSE]
      
  # Identify zero-variance columns (after removing NA-only columns)
  zero_variance_cols <- sapply(to_scale, function(x) sd(x, na.rm = TRUE) == 0)
  zero_variance_cols[is.na(zero_variance_cols)] <- TRUE  # Treat NA-sd columns as zero variance
        
  # Scale only columns with variance
  scaled_part <- to_scale
  if (any(!zero_variance_cols)) {
    scaled_part[, !zero_variance_cols] <- scale(to_scale[, !zero_variance_cols])
    }
        
  # Combine back
  mygrd_ssu <- cbind(to_keep, scaled_part)
  mygrd_ssu <- mygrd_ssu[, names(mygrd_ssu)]
        
  ### Determine the Optimal Number of Clusters for this PSU ###
  wss_values <- sapply(1:10, function(k) kmeans(mygrd_ssu[, -1], centers = k, nstart = 10)$tot.withinss)
  optimal_k <- which.max(diff(diff(wss_values))) + 1  # Find biggest drop in WSS
  # Ensure a minimum of num_primary_ssus clusters
  optimal_k <- num_primary_ssus
        
  # Perform K-means clustering for this PSU
  kmeans_result <- kmeans(mygrd_ssu[, -1], centers = optimal_k, iter.max = 10000, nstart = 10)
  ssu_data$cluster <- kmeans_result$cluster  # Assign clusters
  #str(ssu_data)
  ssu_data$cluster <- as.factor(ssu_data$cluster)
        
  ### CSC Sampling: Select Closest SSUs to Cluster Centers ###
  # Compute distances from cluster centers to all SSUs
  D <- rdist(x1 = kmeans_result$centers, x2 = mygrd_ssu[, -1])  # Distance matrix
  # Select the SSU closest to each cluster center (Targets)
  target_units <- apply(D, MARGIN = 1, FUN = function(x) order(x)[1])  # Closest
  target_ssus <- ssu_data[target_units, ]
        
  # Select the SSU second closest to each cluster center (Replacements)
  replacement_units <- apply(D, MARGIN = 1, FUN = function(x) order(x)[2])  # Second closest
  replacement_ssus <- ssu_data[replacement_units, ]
        
  # Ensure replacements are in the same cluster as targets
  # Add type info
  target_ssus$SSU_Type <- "Target"
  replacement_ssus$SSU_Type <- "Replacement"
        
  # Assign SSU_IDs: 1–4 for Targets, 5–8 for Replacements
  target_ssus$SSU_ID <- 1:nrow(target_ssus)
  replacement_ssus$SSU_ID <- (nrow(target_ssus) + 1):(2 * nrow(target_ssus))
        
  # Match replacements to target IDs by cluster
  replacement_ssus$replacement_for <- sapply(replacement_ssus$cluster, function(cl) {
  matched <- target_ssus$SSU_ID[target_ssus$cluster == cl]
  if (length(matched) > 0) return(matched[1]) else return(NA)})
        
  # Add consistency for Targets
  target_ssus$replacement_for <- NA
        
  # Combine and store
  # Add PSU_ID to both sets
  psu_actual_id <- selected_psu$ID
  target_ssus$PSU_ID <- psu_actual_id
  replacement_ssus$PSU_ID <- psu_actual_id
  selected_ssus_sf[[psu_id]] <- rbind(target_ssus, replacement_ssus)
        
  ### Generate TSUs for Primary and Alternative SSUs ###
  primary_tsus <- lapply(rownames(target_ssus), function(index) {
    generate_tsu_points_within_ssu(ssu_grid_sf[rownames(ssu_grid_sf) == index, ], number_TSUs, index, "Target", crops)
    })
        
  alternative_tsus <- lapply(rownames(replacement_ssus), function(index) {
    generate_tsu_points_within_ssu(ssu_grid_sf[rownames(ssu_grid_sf) == index, ], number_TSUs, index, "Replacement", crops)
    })
        
  # Combine all TSUs for the current PSU
  alt_psus_tsus_sf[[psu_id]] <- do.call(rbind, c(primary_tsus, alternative_tsus))
  
  } else {
    warning(paste("PSU", psu_id, "does not have enough SSUs for selection. Skipping."))
  }
}
## Progress: 100.00% (387 out of 387)
# Combine TSUs from all PSUs into one sf object
  # Combine TSUs from all PSUs into one sf object
  all_ssus_combined_sf <- do.call(rbind, selected_ssus_sf)
  all_ssus_combined_sf <- all_ssus_combined_sf %>%
    mutate_at(vars(PSU_ID, SSU_ID), as.numeric)
    
  alt_tsus_combined_sf <- do.call(rbind, alt_psus_tsus_sf)
  alt_tsus_combined_sf <- alt_tsus_combined_sf %>%
    mutate_at(vars(PSU_ID, SSU_ID), as.numeric)
    
  alt_tsus_combined_sf <- st_join(alt_tsus_combined_sf, all_ssus_combined_sf[c("PSU_ID", "SSU_ID", "SSU_Type", "replacement_for")])
    
  alt_tsus_combined_sf <- alt_tsus_combined_sf %>%
    select(PSU_ID = PSU_ID.x, SSU_ID = SSU_ID.y, SSU_Type = SSU_Type.y,
           Replacement_for = replacement_for, TSU_ID, geometry)
    
  alt_tsus_combined_sf$TSU_Type <- "Target"
  alt_tsus_combined_sf[alt_tsus_combined_sf$TSU_ID >1,"TSU_Type"] <- "Alternative"
  alt_tsus_combined_sf$PSU_Type <- "Target"
    
  alt_tsus_combined_sf <- alt_tsus_combined_sf %>%
    dplyr::select("PSU_ID", "SSU_ID", "SSU_Type", "Replacement_for", "TSU_ID", "TSU_Type", "geometry")

1.17 Exporting the alternative sampling units

This section covers the process of exporting target sampling units.

Cluster information from the alternative PSUs is joined to the TSUs for identification and analysis.

replacements <- replacements %>%
  rename(Replace_ID= cluster)

alt_tsus_combined_sf <- st_join(alt_tsus_combined_sf, replacements)

Then, an order is assigned to the TSUs within each alternative PSU. The sampling order differentiates between target TSUs (1–4) and alternative TSUs (5–7).

alt_tsus_combined_sf <- alt_tsus_combined_sf %>%
  group_by(PSU_ID) %>%
  mutate(order = match(SSU_ID, unique(SSU_ID))) %>%
  ungroup()

Each TSU is assigned a unique site ID for traceability. The ID is constructed using the country code, PSU ID, SSU order, and TSU ID.

alt_tsus_combined_sf$site_id = paste0(ISO.code, sprintf("%04d", alt_tsus_combined_sf$PSU_ID), "-", alt_tsus_combined_sf$SSU_ID, "-", alt_tsus_combined_sf$TSU_ID, "C")

Finally, alternative PSUs and TSUs are saved as shapefiles for further use.

write_sf(replacements,paste0(results.path,"/PSUs_replacements.shp"), overwrite=TRUE)

write_sf(alt_tsus_combined_sf,paste0(results.path,"/TSUs_replacements.shp"), overwrite=TRUE)

1.18 Final outputs

This section summarizes how to process and export information about PSUs, including counting PSUs per cluster, joining valid and remaining PSUs for availability analysis, and saving the final outputs as shapefiles.

The number of PSUs available is counted in each cluster for the valid PSUs. Here, valid.PSU_clusters refers to the dataset containing valid PSUs, and the group_by() function ensures that counts are aggregated by each Replace_ID (representing unique clusters). The summarise(Count = n()) calculates the total number of PSUs in each cluster.

valid_counts <- valid.PSU_clusters %>%
  group_by(Replace_ID) %>%
  summarise(Count = n())

Similar operation for the remaining (alternative) PSUs, here calculates how many alternative PSUs are available in each cluster.

remaining_counts <- remaining.PSU_clusters %>%
  group_by(cluster) %>%
  summarise(Count = n())

The st_join() function combines the two datasets by matching cluster IDs. The resulting dataset includes both valid and remaining PSU counts, allowing for easy comparison. The shapefile is saved to the results folder (results.path).

availability <- st_join(valid_counts, remaining_counts, by = "cluster", suffix = c("_valid", "_remaining"))

write_sf(availability,paste0(results.path,"/availability.shp"), overwrite=TRUE)

Last but not least, a single file is created to merge the shapefiles for target PSUs (PSUs_target.shp) and replacement PSUs (PSUs_replacements.shp).

tar <- sf::st_read(file.path(paste0(results.path,"PSUs_target.shp")))
repl<- sf::st_read(file.path(paste0(results.path,"PSUs_replacements.shp")))

crop_PSU <- rbind(tar,repl)

crop_PSU <- crop_PSU %>%
  group_by(Replace_ID) %>%
  summarize(geometry = st_union(geometry))

write_sf(crop_PSU,paste0(results.path,"/target_repl_PSUs.shp"), overwrite=TRUE)